NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Learning from Uncertain Data: From Possible Worlds to Possible Models

Zhu, Jiongli; Feng, Su; Glavic, Boris; Salimi, Babak (February 2025, NeurIPS 2024)

We introduce an efficient method for learning linear models from uncertain data, where uncertainty is represented as a set of possible variations in the data, leading to predictive multiplicity. Our approach leverages abstract interpretation and zonotopes, a type of convex polytope, to compactly represent these dataset variations, enabling the symbolic execution of gradient descent on all possible worlds simultaneously. We develop techniques to ensure that this process converges to a fixed point and derive closed-form solutions for this fixed point. Our method provides sound over-approximations of all possible optimal models and viable prediction ranges. We demonstrate the effectiveness of our approach through theoretical and empirical analysis, highlighting its potential to reason about model and prediction uncertainty due to data quality issues in training data.
more » « less
Free, publicly-accessible full text available February 13, 2026
FastPDB: Towards Bag-Probabilistic Queries at Interactive Speeds

https://doi.org/10.1145/3709691

Huber, Aaron; Kennedy, Oliver; Rudra, Atri; Zhao, Zhuoyue; Feng, Su; Glavic, Boris (February 2025, Proceedings of the ACM on Management of Data)

Probabilistic databases (PDBs) provide users with a principled way to query data that is incomplete or imprecise. In this work, we study computing expected multiplicities of query results over probabilistic databases under bag semantics which has PTIME data complexity. However, does this imply that bag probabilistic databases are practical? We strive to answer this question from both a theoretical as well as a systems perspective. We employ concepts from fine-grained complexity to demonstrate that exact bag probabilistic query processing is fundamentally less efficient than deterministic bag query evaluation, but that fast approximations are possible by sampling monomials from a circuit representation of a result tuple's lineage. A remaining issue, however, is that constructing such circuits, while in PTIME, can nonetheless have significant overhead. To avoid this cost, we utilize approximate query processing techniques to directly sample monomials without materializing lineage upfront. Our implementation inFastPDBprovides accurate anytime approximation of probabilistic query answers and scales to datasets orders of magnitude larger than competing methods.
more » « less
Free, publicly-accessible full text available February 10, 2026
Efficient Approximation of Certain and Possible Answers for Ranking and Window Queries over Uncertain Data

https://doi.org/10.14778/3583140.3583151

Feng, Su; Glavic, Boris; Kennedy, Oliver (February 2023, Proceedings of the VLDB Endowment)

Uncertainty arises naturally in many application domains due to, e.g., data entry errors and ambiguity in data cleaning. Prior work in incomplete and probabilistic databases has investigated the semantics and efficient evaluation of ranking and top-k queries over uncertain data. However, most approaches deal with top-k and ranking in isolation and do represent uncertain input data and query results using separate, incompatible data models. We present an efficient approach for under- and over-approximating results of ranking, top-k, and window queries over uncertain data. Our approach integrates well with existing techniques for querying uncertain data, is efficient, and is to the best of our knowledge the first to support windowed aggregation. We design algorithms for physical operators for uncertain sorting and windowed aggregation, and implement them in PostgreSQL. We evaluated our approach on synthetic and real world datasets, demonstrating that it outperforms all competitors, and often produces more accurate results.
more » « less
Full Text Available
Efficient Uncertainty Tracking for Complex Queries with Attribute-level Bounds

https://doi.org/10.1145/3448016.3452791

Feng, Su; Glavic, Boris; Huber, Aaron; Kennedy, Oliver A. (June 2021, SIGMOD '21: International Conference on Management of Data)
null (Ed.)
Incomplete and probabilistic database techniques are principled methods for coping with uncertainty in data. unfortunately, the class of queries that can be answered efficiently over such databases is severely limited, even when advanced approximation techniques are employed. We introduce attribute-annotated uncertain databases (AU-DBs), an uncertain data model that annotates tuples and attribute values with bounds to compactly approximate an incomplete database. AU-DBs are closed under relational algebra with aggregation using an efficient evaluation semantics. Using optimizations that trade accuracy for performance, our approach scales to complex queries and large datasets, and produces accurate results.
more » « less
Full Text Available
DataSense: Display Agnostic Data Documentation

Kumari, Poonam; Brachmann, Michael; Kennedy, Oliver; Feng, Su; Glavic, Boris (January 2021, Conference on Innovative Data Systems Research)
Boncz, Peter; Ozcan, Fatma; Patel, Jignesh (Ed.)
Documentation of data is critical for understanding the semantics of data, understanding how data was created, and for raising aware- ness of data quality problem, errors, and assumptions. However, manually creating, maintaining, and exploring documentation is time consuming and error prone. In this work, we present our vi- sion for display-agnostic data documentation (DAD), a novel data management paradigm that aids users in dealing with documenta- tion for data. We introduce DataSense, a system implementing the DAD paradigm. Specifically, DataSense supports multiple types of documentation from free form text to structured information like provenance and uncertainty annotations, as well as several display formats for documentation. DataSense automatically computes documentation for derived data. A user study we conducted with uncertainty documentation produced by DataSense demonstrates the benefits of documentation management.
more » « less
Full Text Available
DataSense: Display-Agnostic Data Documentation

Kumari, Poonam; Brachmann, Michael; Kennedy, Oliver; Feng, Su; Glavic, Boris (January 2021, Conference on Innovative Data Systems Research)
null (Ed.)
Documentation of data is critical for understanding the semantics of data, understanding how data was created, and for raising awareness of data quality problem, errors, and assumptions. However, manually creating, maintaining, and exploring documentation is time consuming and error prone. In this work, we present our vision for display-agnostic data documentation (DAD), a novel data management paradigm that aids users in dealing with documentation for data. We introduce DataSense, a system implementing the DAD paradigm. Specifically, DataSense supports multiple types of documentation from free form text to structured information like provenance and uncertainty annotations, as well as several display formats for documentation. DataSense automatically computes documentation for derived data. A user study we conducted with uncertainty documentation produced by DataSense demonstrates the benefits of documentation management.
more » « less
Full Text Available
Uncertainty Annotated Databases - A Lightweight Approach for Approximating Certain Answers

https://doi.org/10.1145/3299869.3319887

Feng, Su; Huber, Aaron; Glavic, Boris; Kennedy, Oliver (June 2019, SIGMOD)

Certain answers are a principled method for coping with uncertainty that arises in many practical data management tasks. Unfortunately, this method is expensive and may exclude useful (if uncertain) answers. Thus, users frequently resort to less principled approaches to resolve uncertainty. In this paper, we propose Uncertainty Annotated Databases (UA-DBs), which combine an under- and over-approximation of certain answers to achieve the reliability of certain answers, with the performance of a classical database system. Furthermore, in contrast to prior work on certain answers, UA-DBs achieve a higher utility by including some (explicitly marked) answers that are not certain. UA-DBs are based on incomplete K-relations, which we introduce to generalize the classical set-based notion of incomplete databases and certain answers to a much larger class of data models. Using an implementation of our approach, we demonstrate experimentally that it efficiently produces tight approximations of certain answers that are of high utility.
more » « less
Full Text Available
Data Debugging and Exploration with Vizier

https://doi.org/10.1145/3299869.3320246

Brachmann, Mike; Spoth, William; Yang, Ying; Bautista, Carlos; Castelo, Sonia; Feng, Su; Freire, Juliana; Glavic, Boris; Kennedy, Oliver; Müeller, Heiko; et al (January 2019, SIGMOD)

We present Vizier, a multi-modal data exploration and debugging tool. The system supports a wide range of operations by seamlessly integrating Python, SQL, and automated data curation and debugging methods. Using Spark as an execution backend, Vizier handles large datasets in multiple formats. Ease-of-use is attained through integration of a notebook with a spreadsheet-style interface and with visualizations that guide and support the user in the loop. In addition, native support for provenance and versioning enable collaboration and uncertainty management. In this demonstration we will illustrate the diverse features of the system using several realistic data science tasks based on real data.
more » « less
Full Text Available

Search for: All records